# Accessing images from the loc.gov JSON API for image analysis

The digital collections' images available from the [Library of Congress website](https://www.loc.gov) are an amazing resource for study and analysis using digital methods. This notebook shows how you can use the loc.gov JSON API and Python to access sets of images.

More information about API is at [About the loc.gov JSON API](https://libraryofcongress.github.io/data-exploration/). For more example code in Jupyter notebooks, check out [LC for Robots](https://labs.loc.gov/lc-for-robots). 

### Rights and access
Rights and restrictions, including copyright, affect how you can use images, particularly if you want to publish, display, or otherwise distribute them. You can read more in [About Copyright and the Collections](https://www.loc.gov/legal/) and [Copyright and Other Restrictions That Apply to Publication/Distribution of Images:
Assessing the Risk of Using a P&P Image](http://www.loc.gov/rr/print/195_copr.html). There is also information about rights specific to the collection and item in several fields in the API response. See the end of this notebook for relevant fields.

### Image sizes and information about collections
In many cases, the images available via the API are a small thumbnail (150px on one side). Consider whether this is sufficient for your research methods and needs. If you need to work with higher resolution images than are available via the API, you'll need to consult with staff who work with the collection(s) to determine whether those are available for use on-site at the Library of Congress and what restrictions apply. Here are some examples of images that are 150px on one side: 

![https://www.loc.gov/item/2017691815/](https://cdn.loc.gov/service/pnp/fsa/8b02000/8b02800/8b02895_150px.jpg)
![https://www.loc.gov/item/2011645392/](https://cdn.loc.gov/service/pnp/ppmsca/31200/31267_150px.jpg)

It can also be helpful to discuss with staff your intended analysis, as factors such as the sources of metadata and how items were digitized may affect your findings. The [Ask a Librarian](http://www.loc.gov/rr/askalib/) service can help you get in touch with staff who work with the collections.

### More APIs for working with images
Some of the Library of Congress images, including those of newspapers in the [Chronicling America](https://chroniclingamerica.loc.gov/) collection, are available from an [IIIF API](http://iiif.io/). This is an API that allows you to zoom, rotate, resize, and work with images in more complex ways. For a brief intro to using the IIIF API, see [this Jupyter notebook on IIIF](https://github.com/LibraryOfCongress/data-exploration/blob/master/loc.gov%20IIIF%20API/IIIF.ipynb). 
 

## Identifying items 

So let's get started accessing images! If you haven't already, browse or search the [Library of Congress Digital Collections](https://www.loc.gov/collections) website to see what is available and refine your search parameters. The website lets you search by keyword, format, and filter. You can also explore images by collection. While the API documentation has [detailed information on parameters](https://libraryofcongress.github.io/data-exploration/requests.html) you can use in your search, the website often gives you good context that is useful in understanding the collections and content.

Once you've found a search that targets the items that interest you, **copy the URL**. That will be the **base URL** for your API request. 

For example: 

``https://www.loc.gov/collections/baseball-cards/``

``https://www.loc.gov/photos/?q=bridges&dates=1800%2F1899``

### Calling the API and retrieving image URLs

The function below will call the API, adding the following parameters:
* ``fo=json`` to get JSON format in response
* ``c=100`` a count of 100 results in each response, rather than the default 25
* ``at=results,pagination`` provide only the ``results`` and ``pagination`` parts of the response. The API response is otherwise very long with information we don't need.

The response will have a field called ``image_url``: 

``
"image_url": [
"//cdn.loc.gov/service/pnp/cph/3f00000/3f05000/3f05300/3f05332_150px.jpg",
"//cdn.loc.gov/service/pnp/cph/3f00000/3f05000/3f05300/3f05332_150px.jpg#h=150&w=100",
"//cdn.loc.gov/service/pnp/cph/3f00000/3f05000/3f05300/3f05332t.gif#h=150&w=100",
"//cdn.loc.gov/service/pnp/cph/3f00000/3f05000/3f05300/3f05332r.jpg#h=640&w=425",
"//cdn.loc.gov/service/pnp/cph/3f00000/3f05000/3f05300/3f05332v.jpg#h=1024&w=680"
],
``

The last one listed is usually the largest of the image files publicly available, so the function retrieves the last item in the ``image_url`` array. 

In [1]:
import requests

def get_image_urls(url, items=[]):
    '''
    Retrieves the image URLs for items that have public URLs available. 
    Skips over items that are for the colletion as a whole or web pages about the collection.
    Handles pagination. 
    '''
    # request pages of 100 results at a time
    params = {"fo": "json", "c": 100, "at": "results,pagination"}
    call = requests.get(url, params=params)
    data = call.json()
    results = data['results']
    for result in results:
        # don't try to get images from the collection-level result
        if "collection" not in result.get("original_format") and "web page" not in result.get("original_format"):
            # take the last URL listed in the image_url array
            if result.get("image_url"):
                item = result.get("image_url")[-1]
                items.append(item)
    if data["pagination"]["next"] is not None: # make sure we haven't hit the end of the pages
        next_url = data["pagination"]["next"]
        print("getting next page: {0}".format(next_url))
        get_image_urls(next_url, items)
        
    return items


The Library of Congress has a digitized collection of Baseball Cards collection from the late 19th and eary 20th century. Here's an example:

!["https://www.loc.gov/collections/static/baseball-cards/images/Ward0006.jpg"](https://www.loc.gov/collections/static/baseball-cards/images/Ward0006.jpg)

Let's retrieve URLs for images in the Baseball Cards collection:

In [2]:
image_urls = get_image_urls("https://www.loc.gov/collections/baseball-cards/", items=[])

getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=2
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=3
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=4
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=5
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=6
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=7
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=8
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=9
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=10
getting next page: https://

How many URLs were retrieved?

In [3]:
len(image_urls)

2085

Here are the URLs retrieved from the ``image_url`` field in the first five results in the API response. As you can see by the way the files have been named, these have one dimension being 1024 pixels. 

In [4]:
image_urls[:5]

['//cdn.loc.gov/service/pnp/bbc/0000/0000/0001fv.jpg#h=1024&w=575',
 '//cdn.loc.gov/service/pnp/bbc/0000/0000/0002fv.jpg#h=1024&w=569',
 '//cdn.loc.gov/service/pnp/bbc/0000/0000/0003fv.jpg#h=1024&w=538',
 '//cdn.loc.gov/service/pnp/bbc/0000/0000/0004fv.jpg#h=1024&w=564',
 '//cdn.loc.gov/service/pnp/bbc/0000/0000/0005fv.jpg#h=1024&w=539']

You could pass this list of URLs to a tool like wget to collect the images. A helpful tutorial is [Automated Downloading with wget](https://programminghistorian.org/lessons/automated-downloading-with-wget). 

### Rights and restrictions data in the API

The API response also contains data about rights and restrictions, just as seen on the loc.gov website. These fields are available in the API response for a single item. 

To make an API request for a single item, use the URL in the ``id`` field of the results section and add ``fo=json``. 

We'll request item information for a specific item in the Works Progress Administration Posters collection and look at the rights associated with it: 

In [5]:
import json

# request JSON for a single item
r = requests.get("https://www.loc.gov/item/98513999/?fo=json")
r_data = r.json()

**rights_information**

In [6]:
print(json.dumps(r_data["item"]["rights_information"], indent=2))

"No known restrictions on publication."


**rights_advisory**

In [7]:
print(json.dumps(r_data["item"]["rights_advisory"], indent=2))

"No known restrictions on publication."


**rights**

In [8]:
print(json.dumps(r_data["item"]["rights"], indent=2))

[
  "<p>The Library of Congress does not&nbsp;own rights to material  in its collections. Therefore, it does not license or charge permission fees  for use of such material and cannot grant or deny permission to publish or  otherwise distribute the material. </p>\n<p>Ultimately, it is the researcher's obligation to assess copyright or other use restrictions and  obtain permission from third parties when necessary before publishing or  otherwise distributing materials found in the Library's collections. </p>\n<p>For information about reproducing,  publishing, and citing material from this collection, as well as access to the  original items, see: <a href=\"//www.loc.gov/rr/print/res/217_wpa.html\">Work Projects Administration Posters - Rights and Restrictions Information</a> </p>\n \n"
]


**restriction**

Restriction is often blank.

In [9]:
print(json.dumps(r_data["item"]["restriction"], indent=2))

""


## Accessing images

Now that you have the URLs for images, you can access them as part of your image analysis or download them. Just be sure to check out the rights and restrictions information about how you can distribute or display them.

The code below will acccess and save the images located the URLs you saved earlier. It saves them to a directory, so if you haven't already created a directory where you want to save the files, do that now. 

In [10]:
!mkdir images

In [11]:
import os

def get_image_files(image_urls_list, path):
    '''
    Takes as input a list of URLs for loc.gov item pages and 
    a path to a directory in which to save image files, e.g. "data". 
    '''    
    for count, url in enumerate(image_urls_list):
        if count % 100 == 0:
            print("at item {0}".format(count))
        try:
            #filename = create a filename based on the last part of the URL.
            filename = url.split('/')[-1]
            filename = os.path.join(path, filename)        
            # request the image and write to path
            full_url = "https:{0}".format(url)
            image_response = requests.get(full_url, stream=True)
            with open(filename, 'wb') as fd:
                for chunk in image_response.iter_content(chunk_size=100000):
                    fd.write(chunk)
        except ConnectionError as e:
            print(e)

In [12]:
get_image_files(image_urls, "images")

at item 0
at item 100
at item 200
at item 300
at item 400
at item 500
at item 600
at item 700
at item 800
at item 900
at item 1000
at item 1100
at item 1200
at item 1300
at item 1400
at item 1500
at item 1600
at item 1700
at item 1800
at item 1900
at item 2000


Let's see what we downloaded.

In [13]:
! ls -la images

total 1986832
drwxr-xr-x  2087 lwrubel  405867214   70958 Feb 12 09:22 [34m.[m[m
drwxr-xr-x    23 lwrubel  405867214     782 Feb 12 09:20 [34m..[m[m
-rw-r--r--     1 lwrubel  405867214  433209 Feb 12 09:09 0001fv.jpg#h=1024&w=575
-rw-r--r--     1 lwrubel  405867214  392925 Feb 12 09:09 0002fv.jpg#h=1024&w=569
-rw-r--r--     1 lwrubel  405867214  405105 Feb 12 09:09 0003fv.jpg#h=1024&w=538
-rw-r--r--     1 lwrubel  405867214  403250 Feb 12 09:09 0004fv.jpg#h=1024&w=564
-rw-r--r--     1 lwrubel  405867214  395550 Feb 12 09:09 0005fv.jpg#h=1024&w=539
-rw-r--r--     1 lwrubel  405867214  401391 Feb 12 09:09 0006fv.jpg#h=1024&w=546
-rw-r--r--     1 lwrubel  405867214  339776 Feb 12 09:09 0007fv.jpg#h=1024&w=561
-rw-r--r--     1 lwrubel  405867214  370347 Feb 12 09:09 0008fv.jpg#h=1024&w=567
-rw-r--r--     1 lwrubel  405867214  405582 Feb 12 09:09 0009fv.jpg#h=1024&w=561
-rw-r--r--     1 lwrubel  405867214  561813 Feb 12 09:09 0010fv.jpg#h=1024&w=615
-rw-r--r--     1 lwrubel  405867214

This method works for situations where we you need a batch of images from one collection. However, if your images are from more than one collection, you might end up with files with the same name and end up overwriting them. And either way, these filenames don't tell you anything about the item. You might need to look up further metadata. So, here's an alternative approach that renames the file with the identifier used on the loc.gov website. We'll first re-fetch the image URL for each item and download the file, renaming it using the identifier. 

In [18]:
from urllib.parse import urlparse

def get_and_save_images(results_url, path):
    '''
    Takes as input the url for the collection or results set
    e.g. https://www.loc.gov/collections/baseball-cards
    and a list of items (used for pagination)
    '''
    params = {"fo": "json", "c": 100, "at": "results,pagination"}
    call = requests.get(results_url, params=params)
    data = call.json()
    results = data['results']
    for result in results:
        # don't try to get images from the collection-level result or web page results
        if "collection" not in result.get("original_format") and "web page" not in result.get("original_format"):
            if result.get("image_url"):
                image = "https:" + result.get("image_url")[-1]
                # create a filename that's the identifier portion of the item URL
                identifier = urlparse(result["id"])[2].rstrip('/')
                identifier = identifier.split('/')[-1]
                filename = "{0}.jpg".format(identifier)
                filename = os.path.join(path, filename)
            
                # request the image and write to path
                image_response = requests.get(image, stream=True)
                with open(filename, 'wb') as fd:
                    for chunk in image_response.iter_content(chunk_size=100000):
                        fd.write(chunk)
    
    if data["pagination"]["next"] is not None: # make sure we haven't hit the end of the pages
        next_url = data["pagination"]["next"]
        print("getting next page: {0}".format(next_url))
        get_and_save_images(next_url, path)

In [19]:
!mkdir images-named

In [20]:
get_and_save_images("https://www.loc.gov/collections/baseball-cards/", "images-named")

getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=2
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=3
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=4
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=5
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=6
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=7
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=8
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=9
getting next page: https://www.loc.gov/collections/baseball-cards/?at=results,pagination&c=100&fo=json&sp=10
getting next page: https://

Let's take a quick look at the path for the directory to confirm that the images were downloaded. 

In [21]:
!ls -lah images-named/

total 1986832
drwxr-xr-x  2087 lwrubel  405867214    69K Feb 12 09:41 [34m.[m[m
drwxr-xr-x    24 lwrubel  405867214   816B Feb 12 09:42 [34m..[m[m
-rw-r--r--     1 lwrubel  405867214   332K Feb 12 09:27 2007677698.jpg
-rw-r--r--     1 lwrubel  405867214   396K Feb 12 09:27 2007677699.jpg
-rw-r--r--     1 lwrubel  405867214   386K Feb 12 09:27 2007678537.jpg
-rw-r--r--     1 lwrubel  405867214   392K Feb 12 09:27 2007678538.jpg
-rw-r--r--     1 lwrubel  405867214   423K Feb 12 09:27 2007678540.jpg
-rw-r--r--     1 lwrubel  405867214   384K Feb 12 09:27 2007678541.jpg
-rw-r--r--     1 lwrubel  405867214   396K Feb 12 09:27 2007678542.jpg
-rw-r--r--     1 lwrubel  405867214   394K Feb 12 09:27 2007678545.jpg
-rw-r--r--     1 lwrubel  405867214   362K Feb 12 09:27 2007680699.jpg
-rw-r--r--     1 lwrubel  405867214   529K Feb 12 09:27 2007680715.jpg
-rw-r--r--     1 lwrubel  405867214   509K Feb 12 09:27 2007680716.jpg
-rw-r--r--     1 lwrubel  405867214   526K Feb 12 09:27 2007680717

### Connecting the image file to the metadata
The filename the code creates is the item's identifier, so you reconstruct a URL for the item's metadata. For example, to examine at the metadata for the first item in the list, 2007685715.jpg, you can add ``https://www.loc.gov/item/`` before the identifier. 

``https://www.loc.gov/item/2007685715``

You can also request the metadata in JSON format by adding ``?fo=json&at=item`` at the end. 

In [22]:
r = requests.get("https://www.loc.gov/item/2007685715/?fo=json")
r_data = r.json()
print(json.dumps(r_data["item"], indent=2))

{
  "format": [
    {
      "photo, print, drawing": "https://www.loc.gov/search/?fa=original_format:photo,+print,+drawing&fo=json"
    }
  ],
  "subject_headings": [
    "Marquard, Rube (Team member)",
    "New York Giants",
    "New York",
    "National League",
    "pitcher"
  ],
  "id": "2007685715",
  "medium": [
    "1 print : relief with halftone, color."
  ],
  "marc": "//www.loc.gov/pictures/item/2007685715/marc/",
  "other_formats": [
    {
      "link": "https://lccn.loc.gov/2007685715/marcxml",
      "label": "MARCXML Record"
    },
    {
      "link": "https://lccn.loc.gov/2007685715/mods",
      "label": "MODS Record"
    },
    {
      "link": "https://lccn.loc.gov/2007685715/dc",
      "label": "Dublin Core Record"
    }
  ],
  "sort_date": "1912",
  "summary": "",
  "rights": [
    "<p>The Library of Congress does not&nbsp;own rights to material  in its collections. Therefore, it does not license or charge permission fees  for use of such material and cannot grant or d

## Conclusion

Access to batches of images opens up the door to new forms of analysis, research, and experimentation. 

We've shown where you can find information about rights and restrictions on use of images, how to find the image URLs in the API response, and how to save images if your analysis requires having the files on disk. 

Best of luck with your searching and let us know how your project goes! You can find us at [@LC_Labs on Twitter](https://www.twitter.com/LC_Labs) or email us at ndi@loc.gov.